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Abstract 

Screening rules allow to early discard irrelevant 
variables from the optimization in Lasso prob¬ 
lems, or its derivatives, making solvers faster. In 
this paper, we propose new versions of the so- 
called safe rules for the Lasso. Based on duality 
gap considerations, our new rules create safe test 
regions whose diameters converge to zero, pro¬ 
vided that one relies on a converging solver. This 
property helps screening out more variables, for 
a wider range of regularization parameter values. 
In addition to faster convergence, we prove that 
we correctly identify the active sets (supports) 
of the solutions in finite time. While our pro¬ 
posed strategy can cope with any solver, its per¬ 
formance is demonstrated using a coordinate de¬ 
scent algorithm particularly adapted to machine 
learning use cases. Significant computing time 
reductions are obtained with respect to previous 
safe rules. 


1. Introduction 

Since the mid 1990’s, high dimensional statistics has at¬ 
tracted considerable attention, especially in the context of 
linear regression with more explanatory variables than ob¬ 
servations: the so-called p > n case. In such a context, the 
least squares with i\ regularization, referred to as the Lasso 
(Tibshirani, 1996) in statistics, or Basis Pursuit (Chen et al., 
1998) in signal processing, has been one of the most pop¬ 
ular tools. It enjoys theoretical guarantees (Bickel et al., 
2009), as well as practical benefits: it provides sparse solu¬ 
tions and fast convex solvers are available. This has made 
the Lasso a popular method in modern data-science tool¬ 
kits. Among successful fields where it has been applied. 
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one can mention dictionary learning (Mairal, 2010), bio¬ 
statistics (Haury et al., 2012) and medical imaging (Lustig 
et al., 2007; Gramfort et al., 2012) to name a few. 

Many algorithms exist to approximate Lasso solutions, but 
it is still a burning issue to accelerate solvers in high di¬ 
mensions. Indeed, although some other variable selection 
and prediction methods exist (Fan & Lv, 2008), the best 
performing methods usually rely on the Lasso. For sta¬ 
bility selection methods (Meinshausen & Buhlmann, 2010; 
Bach, 2008; Varoquaux et al., 2012), hundreds of Lasso 
problems need to be solved. For non-convex approaches 
such as SCAD (Fan & Li, 2001) or MCP (Zhang, 2010), 
solving the Lasso is often a required preliminary step (Zou, 
2006; Zhang & Zhang, 2012; Candes et al., 2008). 

Among possible algorithmic candidates for solving the 
Lasso, one can mention homotopy methods (Osborne et al., 
2000), LARS (Efron et al., 2004), and approximate homo¬ 
topy (Mairal & Yu, 2012), that provide solutions for the full 
Lasso path, i.e., for all possible choices of tuning parame¬ 
ter A. More recently, particularly for p > n, coordinate 
descent approaches (Friedman et al., 2007) have proved to 
be among the best methods to tackle large scale problems. 

Following the seminal work by El Ghaoui et al. (2012), 
screening techniques have emerged as a way to exploit the 
known sparsity of the solution by discarding features prior 
to starting a Lasso solver. Such techniques are coined safe 
rules when they screen out coefficients guaranteed to be 
zero in the targeted optimal solution. Zeroing those coeffi¬ 
cients allows to focus more precisely on the non-zero ones 
(likely to represent signal) and helps reducing the computa¬ 
tional burden. We refer to (Xiang et al., 2014) for a concise 
introduction on safe rules. Other alternatives have tried to 
screen the Lasso relaxing the “safety”. Potentially, some 
variables are wrongly disregarded and post-processing is 
needed to recover them. This is for instance the strategy 
adopted for the strong rules (Tibshirani et al., 2012). 

The original basic safe rules operate as follows: one 
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chooses a fixed tuning parameter A, and before launching 
any solver, tests whether a coordinate can be zeroed or not 
(equivalently if the corresponding variable can be disre¬ 
garded or not). We will refer to such safe rules as static 
safe rules. Note that the test is performed according to a 
safe region, i.e., a region containing a dual optimal solu¬ 
tion of the Lasso problem. In the static case, the screening 
is performed only once, prior any optimization iteration. 
Two directions have emerged to improve on static strate¬ 
gies. 

• The first direction is oriented towards the resolution 
of the Lasso for a large number of tuning parameters. 
Indeed, practitioners commonly compute the Lasso 
over a grid of parameters and select the best one in a 
data-driven manner, e.g., by cross-validation. As two 
consecutive X's in the grid lead to similar solutions, 
knowing the first solution may help improve screen¬ 
ing for the second one. We call sequential safe rules 
such strategies, also referred to as recursive safe rules 
in (El Ghaoui et al., 2012). This road has been pur¬ 
sued in (Wang et al., 2013; Xu & Ramadge, 2013; Xi¬ 
ang et al., 2014), and can be thought of as a “warm 
start” of the screening (in addition to the warm start of 
the solution itself). When performing sequential safe 
rules, one should keep in mind that generally, only an 
approximation of the previous dual solution is com¬ 
puted. Though, the safety of the rule is guaranteed 
only if one uses the exact solution. Neglecting this is¬ 
sue, leads to “unsafe” rules: relevant variables might 
be wrongly disregarded. 

• The second direction aims at improving the screen¬ 
ing by interlacing it throughout the optimization algo¬ 
rithm itself: although screening might be useless at the 
beginning of the algorithm, it might become (more) 
efficient as the algorithm proceeds towards the opti¬ 
mal solution. We call these strategies dynamic safe 
rules following (Bonnefoy et al., 2014a;b). 

Based on convex optimization arguments, we leverage du¬ 
ality gap computations to propose a simple strategy uni¬ 
fying both sequential and dynamic safe rules. We coined 
GAP SAFE rules such safe rules. 


finite time the active variables of the optimal solution (or 
equivalently the inactive variables), and the tests become 
more and more precise as the optimization algorithm pro¬ 
ceeds. We also show that our new GAP SAFE rules, built 
on dual gap computations, are converging safe rules since 
their associated safe regions have a diameter converging to 
zero. We also explain how our GAP SAFE tests are se¬ 
quential by nature. Application of our GAP SAFE rules 
with a coordinate descent solver for the Lasso problem is 
proposed in Section 4. Using standard data-sets, we report 
the time improvement compared to prior safe rules. 


1.1. Model and notation 


We denote by [d] the set {1,..., d} for any integer deN. 
Our observation vector is y G R’" and the design matrix 
X = [xi, ■ ■ ■ , x p ] G R nxp has p explanatory variables (or 
features) column-wise. We aim at approximating y as a 
linear combination of few variables xf s, hence expressing 
y as X/3 where /3 G R p is a sparse vector. The standard 
Euclidean norm is written || • ||, the fj norm || ■ ||i, the 
norm || • ||oo, and the matrix transposition of a matrix Q is 
denoted by Q T . We denote (f) + = max(0, t). 


For such a task, the Lasso is often considered (see 
Buhl man n & van de Geer (2011) for an introduction). For a 
tuning parameter A > 0, controlling the trade-off between 
data fidelity and sparsity of the solutions, a Lasso estimator 

^(A) 

is any solution of the primal optimization problem 

( 5 ^ G argmin ^ ||X/3 — y\\ 2 2 + A H/31^ . (1) 

/3eR p £ _, 

=Px{0) 


Denoting Ax = {9 G M n : ^ 1, Vj G [p]} the 

dual feasible set, a dual formulation of the Lasso reads (see 
for instance Kim et al. (2007) or Xiang et al. (2014)): 


8 {X) = argmax )- \\y\\\ 
QeAx<^^ n ^_ 



=D X (0) 


2 

2 


( 2 ) 


We can reinterpret Eq. (2) as 9 < - x ' 1 = IIa x (p/A), where 
lie refers to the projection onto a closed convex set C. In 
particular, this ensures that the dual solution 0 iXi is always 
unique, contrarily to the primal /T A ). 


The main contributions of this paper are 1) the introduction 
of new safe rules which demonstrate a clear practical im¬ 
provement compared to prior strategies 2) the definition of 
a theoretical framework for comparing safe rules by look¬ 
ing at the convergence of their associated safe regions. 


1.2. A KKT detour 

For the Lasso problem, a primal solution G R p and the 
dual solution (f x> g R" are linked through the relation: 

y = A"/3 (a) + X9 W . (3) 


In Section 2, we present the framework and the basic con¬ 
cepts which guarantee the soundness of static and dynamic 
screening rules. Then, in Section 3, we introduce the new 
concept of converging safe rules. Such rules identify in 


The Karush-Khun-Tucker (KKT) conditions state: 


Vj G [p], xj8 {X) G 


{sign(/3j A) )} 

[- 1 , 1 ] 


if 

if 


* 0, 

4 (A) = o 










Mind the duality gap: safer rules for the Lasso 


Rule 

Center 

Radius 

Ingredients 

Static Safe (El Ghaoui et al., 2012) 

y/ A 

Rx(x^) 

A max 

Amax — \\X y co — 

Dynamic ST3 (Xiang et al., 2011) 

y/\ — Sxj * 

{Rx{9 k ) 2 - 5 2 )s 

8 = - 1) /Iky* f 

Dynamic Safe (Bonnefoy et al., 2014a) 

y/ A 

Rx(9 k ) 

0k 6 Ax (e.g; as in (11) ) 

Sequential (Wang et al., 2013) 

1 ) 

art “ R II^H 

exact §At-i) required 

GAP SAFE sphere (proposed) 

e k 

r \ t {fik, 9 k ) = x t ^2G\ t {f k ,9 k ) 

dual gap for 0 k ,9 k 


Table 1. Review of some common safe sphere tests. 


See for instance (Xiang et al., 2014) for more details. The 
KKT conditions lead to the fact that for A T A max = 
||X T y| 00 , 0 e R p is a primal solution. It can be consid¬ 
ered as the mother of all safe screening rules. So from now 
on, we assume that A T A max for all the considered A’s. 

2. Safe rules 

Safe rules exploit the KKT condition (4). This equation im¬ 
plies that /3j X ^ = 0 as soon as \xj9 ^ | < 1. The main chal¬ 
lenge is that the dual optimal solution is unknown. Hence, a 
safe rule aims at constructing a set C cz M n containing fA A ' . 
We call such a set C a safe region. Safe regions are all the 
more helpful that for many j’s, pc( x j) '■= su Peec I X J ^1 < 
1, hence for many j’s, /ij A) = 0. 

Practical benefits are obtained if one can construct a region 
C for which it is easy to compute its support function, de¬ 
noted by oc and defined for any x e R n by: 

crc(x) = max x T 9 . (5) 

6 >eC 

Cast differently, for any safe region C, any j e [p], and any 
primal optimal solution A X K the following holds true: 

If Pc{ x j) = max(a c ( x j), a c {-Xj)) < 1 then /)j A) = 0. 

( 6 ) 

We call safe test or safe rule, a test associated to C and 
screening out explanatory variables thanks to Eq. (6). 
Remark 1. Reminding that the support function of a set is 
the same as the support function of its closed convex hull 
(Hiriart-Urruty & Lemarechal, 1993) [Proposition V.2.2.1], 
we restrict our search to closed convex safe regions. 

Based on a safe region C one can partition the explanatory 
variables into a safe active set A X (C) and a safe zero set 
Z X (C) where: 

AW(C) = {je\p\:pt c (x j )>lh ( 7 ) 

Z^(C) = {j e [p]:pc( x j)< I}- (8) 

Note that for nested safe regions C\ cz C 2 then H/ A ) (Ci) cz 
A^(C 2 ). Consequently, a natural goal is to find safe re¬ 
gions as small as possible: narrowing safe regions can only 
increase the number of screened out variables. 


Remark 2. If C = {() tx> }, the safe active set is the equicor- 
relation set A^ X \C) = £\ := {j e \p\ : |xJ#( A )| = 1} (in 
most cases (Tibshirani, 2013) it is exactly the support of 
Even when the Lasso is not unique, the equicorrela- 
tion set contains all the solutions’ supports. The other ex¬ 
treme case is when C = Ax, andAl( A )(C) = \p\. Here, no 
variable is screened out: Z^ X \C) = 0 and the screening 
is useless. 

We now consider common safe regions whose support 
functions are easy to obtain in closed form. For simplicity 
we focus only on balls and domes, though more compli¬ 
cated regions could be investigated (Xiang et al., 2014). 

2.1. Sphere tests 

Following previous work on safe rules, we call sphere tests, 
tests relying on balls as safe regions. For a sphere test, one 
chooses a ball containing 0 iX> with center c and radius r, 
i.e., C = B(c , r). Due to their simplicity, safe spheres have 
been the most commonly investigated safe regions (see for 
instance Table 1 for a brief review). The corresponding test 
is defined as follows: 

If Ms(c,r)(^i) = \x]c\ + r\\xj\\ < 1, then /?j A) = 0. (9) 

Note that for a fixed center, the smaller the radius, the better 
the safe screening strategy. 

Example 1. The first introduced sphere test (El Ghaoui 
et al., 2012) consists in using the center c = y /A and radius 
r = 1 1/A — 1/A max ||y||. Given that 0 (A) = n Ax (t//A), 
this is a safe region since y/ A max e and ||t//A max — 
n Ajc(2//A)|| ^ Ml 11/A - 1/A max |. However, one can 
check that this static safe rule is useless as soon as 

a „ . ( 1 + kjyl/dl^llllyll)^ 

t - R mm -—^— . (10) 

^max ie[p] y 1 + A max/dl'Ej || \\y \\) J 

2.2. Dome tests 

Other popular safe regions are domes, the intersection be¬ 
tween a ball and a half-space. This kind of safe region has 
been considered for instance in (El Ghaoui et al., 2012; Xi¬ 
ang & Ramadge, 2012; Xiang et al., 2014; Bonnefoy et al.. 
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Figure 1. Representation of the dome D(c,r, a,w) (dark blue). 
In our case, note that a is positive. 


2014b). We denote D(c, r, a, w) the dome with ball center 
c, ball radius r, oriented hyperplane with unit normal vec¬ 
tor w and parameter a such that c — arw is the projection 
of c on the hyperplane (see Figure 1 for an illustration in 
the interesting case a > 0). 

Remark 3. The dome is non-trivial whenever a e [—1,1], 
When a = 0, one gets simply a hemisphere. 

For the dome test one needs to compute the support func¬ 
tion for C = D(c, r, a, w). Interestingly, as for balls, it can 
be obtained in a closed form. Due to its length though, the 
formula is deferred to the Appendix (see also (Xiang et al., 
2014)[Lemma 3] for more details). 

2.3. Dynamic safe rules 

For approximating a solution B I:X> of the Lasso primal 
problem P\, iterative algorithms are commonly used. We 
denote f3k e R p the current estimate after k iterations of 
any iterative algorithm (see Section 4 for a specific study 
on coordinate descent). Dynamic safe rules aim at discov¬ 
ering safe regions that become narrower as k increases. To 
do so, one first needs dual feasible points: 9 *. e A^. Fol¬ 
lowing El Ghaoui et al. (2012) (see also (Bonnefoy et al., 
2014a)), this can be achieved by a simple transformation of 
the current residuals Pk = y — X/3k, defining 9k as 

[ 9k &kPk ) 

= min[ma X (4^, • 

( 11 ) 

Such dual feasible 9k is proportional to pi ;: , and is the clos¬ 
est point (for the norm || • ||) to y/X in Ax with such a prop¬ 
erty, i.e., 9 k = n AxnSpan(pfc) (y/ A). A reason for choosing 
this dual point is that the dual optimal solution 9^ is the 
projection of y/X on the dual feasible set Ax, and the op¬ 
timal is proportional to y — cf. Equation (3). 

Remark 4. Note that if limfe_» +00 f3k = /T A - ) (convergence 
of the primal) then with the previous display and (3), we 
can show that lim/ c _ > + 0 o 9k = 9 X). Moreover, the conver¬ 


gence of the primal is unaltered by safe rules: screening 
out unnecessary coefficients of /3k, can only decrease the 
distance between /3k and /3^ x \ 

Example 2. Note that any dual feasible point 9 e Aj im¬ 
mediately provides a ball that contains 9^ since 


§w _ y 

= min 

9’~ V - 


9- V - 

A 

0'eAx 

A 


A 


( 12 ) 

The ball B(y/ A, R\(9k)) corresponds to the simplest safe 
region introduced in (Bonnefoy et al., 2014a;b) (cf. Figure 2 
for more insights). When the algorithm proceeds, one ex¬ 
pects that 9k gets closer to 9^ x \ so \\9k — y/ A|j should get 
closer to ||6A a ) — y/ Aj|. Similarly to Example 1, this dy¬ 
namic rule becomes useless once A is too small. More pre¬ 
cisely, this occurs as soon as 

A ( 1 + \xjy\/(\\xj\\\\y\\) \ 

Amax Mp] \A max ||£K A ) ll/llt/1| + Amax/dXjlllj/l) J 

(! 3 ) 

Noticing that ||6^ A )|| < ||y/A|| (since IIa x is a contraction 
and 0 e Ax) and proceeding as for (10), one can show that 
this dynamic safe rule is inefficient when: 


A 

-- mm 

Amax J^\p I 



(14) 


This is a critical threshold, yet the screening might stop 
even at a larger A thanks to Eq. (13). In practice the bound 
in Eq. (13) cannot be evaluated a priori due to the term 
||0l A l ||). Note also that the bound in Eq. (14) is close to the 
one in Eq. (10), explaining the similar behavior observed 
in our experiments (see Figure 3 for instance). 


3. New contributions on safe rules 

3.1. Support discovery in finite time 

Let us first introduce the notions of converging safe regions 
and converging safe tests. 

Definition 1. Let (Ck)ke n be a sequence of closed convex 
sets in R" containing fT A '. It is a converging sequence of 
safe regions for the Lasso with parameter A if the diameters 
of the sets converge to zero. The associated safe screening 
rules are referred to as converging safe tests. 

Not only converging safe regions are crucial to speed up 
computation, but they are also helpful to reach exact active 
set identification in a finite number of steps. More pre¬ 
cisely, we prove that one recovers the equicorrelation set of 
the Lasso ( cf. Remark 2) in finite time with any converg¬ 
ing strategy: after a finite number of steps, the equicor¬ 
relation set E\ is exactly identified. Such a property is 
sometimes referred to as finite identification of the support 
(Liang et al., 2014). This is summarized in the following. 
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(a) Location of the dual optimal (b) Refined location of the dual (c) Proposed GAP SAFE sphere (d) Proposed GAP SAFE dome 
§^ in the annulus. optimal 0^ (dark blue). (orange). (orange). 


Figure 2. Our new GAP SAFE sphere and dome (in orange). The dual optimal solution must lie in the dark blue region; /? is any 
point in R p , and 9 is any point in the dual feasible set Ax. Remark that the GAP SAFE dome is included in the GAP SAFE sphere, and 
that it is the convex hull of the dark blue region. 


Theorem 1. Let ( Ck)k<=N be a sequence of converging 
safe regions. The estimated support provided by Ck, 
= {j e [p] : maxg 6 c fc \9 J Xj\ 0 1}, satisfies 
limfc^oo (Ck) = £\, and there exists ^eN such that 
Mk ^ fc 0 one 8 ets A/ x \Ck) = £\- 


Hence, 




y J(\\y\\*-\\XI3-y\\ 2 -2\m\ 1 )_ 


(16) 


Proof The main idea of the proof is to use that 
limfc^oo Ck = {0 (A) }, limfc-,00 pc k ( x ) = F{§W}( X ) = 
|x T 0( A )| and that the set A) A l(Cfc) is discrete. Details are 
delayed to the Appendix. □ 

Remark 5. A more general result is proved for a spe¬ 
cific algorithm (Forward-Backward) in Liang et al. (2014). 
Interestingly, our scheme is independent of the algorithm 
considered ( e.g Forward-Backward (Beck & Teboulle, 
2009), Primal Dual (Chambolle & Pock, 2011), coordinate- 
descent (Tseng, 2001; Friedman et al., 2007)) and relies 
only on the convergence of a sequence of safe regions. 

3.2. GAP SAFE regions: leveraging the duality gap 

In this section, we provide new dynamic safe rules built on 
converging safe regions. 

Theorem 2. Let us take any (/ 3,6) e M p x A x . Denote 
■■= j{ \\y\\ 2 -\\Xfi-y\\ 2 -2X PH, := 

\\9 ~ y/X\\ , 9^ the dual optimal Lasso solution and 
fx(M) ■= \! Rx(0) 2 - RxW , then 

0 W e B(o,r x (0,O)y (15) 


In particular, this provides ||(L A ) — y/ A|| ^ R\((3). Com¬ 
bining (12) and (16), asserts that (A a ( belongs to the an¬ 
nulus A(y/\,Rx(0),Rx(P)) := {z e R n : R\{fi) < 
||2 — y/A|| < R\(9)} (the light blue zone in Figure 2). 

Remind that the dual feasible set Aj is convex, hence 
A.\- n B{y/X, R\{9)) is also convex. Thanks to (16), 
A x nB{y/X 1 R\{9))^ = A x n A(y/X, R\{9), R\{/3)), and 
then Ax n A{y/X, R\{9), R\(f))) is convex too. Hence, 
0(A) 

is inside the annulus A{y/X,R\{9),R\(/3)) and so 
is c A{y/X, R\(9), R\{/3)) by convexity (see 

Figure 2, (a) and Figure 2,(b)). Moreover, (L A ) is the 
point of [0,0l A l] which is closest to y/X. The farthest 
where 0 (x > can be according to this information would be 
if [0,(?( A 1] were tangent to the inner ball B{y/X, R\{/3)) 
and ||(?1 A ) — y/ A|[ = R\(f3). Let us denote 0j nt such a 
point. The tangency property reads ||0; nt — y/A|| = R\{/3) 
and (9 — 9 ln t) T {y/X — (9i nt ) = 0. Hence, with the later 
and the definition of R\{9), ||9 — y/ A|| 2 = ||0 — $i nt || 2 + 

- y/ Afand || 9 - 0 int || 2 = R x {9 ) 2 - R x {/3) 2 . 

Since by construction (L A ) cannot be further away from 9 
than 9 iut (again, insights can be gleaned from Figure 2), we 
conclude that 9^ e B{9 , {R\{9) 2 — R x {fi) 2 )^ 2 ). □ 


Proof The construction of the ball B{9, r x ((3 , 9)) is based 
on the weak duality theorem (cfi (Rockafellar & Wets, 
1998) for a reminder on weak and strong duality). Fix 
0 e A_y and /3 e R p , then it holds that 


1 

2 


IMI 2 


A^ 

2 




<i||X/3-t/|| 2 + A| 


Remark 6. Choosing /3 = 0 and 9 = y/ A max , then one 
recovers the static safe rule given in Example 1. 

With the definition of the primal (resp. dual) objective for 
P\{/3), (resp. D x (9)), the duality gap reads as G x {(3 , 9) = 
P\{fi) — D\ {9). Remind that if G\{(3,9) < e, then one has 
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P X {P) — P X (P^) < e, which is a standard stopping cri¬ 
terion for Lasso solvers. The next proposition establishes a 
connection between the radius r A (/3, 9) and the duality gap 

Proposition 1. For any (P,8) e K p x Ax, the following 
holds 


?x(P,9) 2 ^ r x (p, 9) 2 := ^G x (p,6). (17) 

Proof. Use the fact that R x {9) 2 = ||0 — y/A|| 2 and 

Rx{P) 2 S* (Ibll 2 - \\XP - y\\ 2 - 2A II/3IIJ/A 2 . □ 


If we could choose the “oracle” 8 = 8 ^ and p = p^ 
in (15) then we would obtain a zero radius. Since those 
quantities are unknown, we rather pick dynamically the 
current available estimates given by an optimization algo¬ 
rithm: P = p k and 8 = 9 k as in Eq. (11). Introducing GAP 
SAFE spheres and domes as below. Proposition 2 ensures 
that they are converging safe regions. 

GAP SAFE sphere: 


C k = B(8 k ,r x (p,8)). (18) 


GAP SAFE dome: 


Ck = D 


l + Sk R x (8 k ) 2 ( R X (flQ 
2 ’ 2 ’ \Rx(O k ) 



(19) 


Proposition 2. For any converging primal sequence 
{Pk)ke N> a nd dual sequence ( 8 k ) ks pj defined as in Eq. (11), 
then the GAP SAFE sphere and the GAP SAFE dome are 
converging safe regions. 


Proof. For the GAP SAFE sphere the result follows 
from strong duality. Remark 4 and Proposition 1 yield 
limfc^oo r x (/3 k , 8 k ) = 0, since lim^oo 8 k = 9 W and 
limfc^oo p k = pP' 1 . For the GAP SAFE dome, one can 
check that it is included in the GAP SAFE sphere, therefore 
inherits the convergence (see also Figure 2,(c) and (d)). D 

Remark 7. The radius r x (P k , 8 k ) can be compared 
with the radius considered for the Dynamic Safe rule 
and Dynamic ST3 (Bonnefoy et al., 2014a) respectively: 
R\(0 k ) = || 8 k - y/ A|| 2 and (R x (8 k ) 2 - <5 2 ) 1/2 , where 
<5 = (A m ax/A — 1)/ \\x,j* |j. We have proved that 

limfc^oo r x (P k , 8 k ) = 0, but a weaker property is satisfied 
by the two other radius: lim^oo R x (8 k ) = R^d^) = 
||-Ra(^ {A) ) - y/A|| 2 and lim fe ^ 00 (.R A (<?/ c ) 2 - A 2 ) 1 / 2 = 
(R x (8^) 2 - 8 2 ) 1 / 2 >0. 


3.3. GAP SAFE rules : sequential for free 

As a byproduct, our dynamic screening tests provide a 
warm start strategy for the safe regions, making our GAP 
SAFE rules inherently sequential. The next proposition 
shows their efficiency when attacking a new tuning param¬ 
eter, after having solved the Lasso for a previous A, even 
only approximately. Handling approximate solutions is a 
critical issue to produce safe sequential strategies: without 
taking into account the approximation error, the screening 
might disregard relevant variables, especially the one near 
the safe regions boundaries. Except for A max , it is unreal¬ 
istic to assume that one can dispose of exact solutions. 

Consider Ao = A max and a non-increasing sequence of 
T - 1 tuning parameters (A t ) te [ T _!] in (0, A max ). In prac¬ 
tice, we choose the common grid (Biihlmann & van de 
Geer, 201 1)[2.12.1]): A t = AoHT 154 /^- 1 ) (for instance 
in Figure 3, we considered 6 = 3). The next result controls 
how the duality gap, or equivalently, the diameter of our 
GAP SAFE regions, evolves from A t -i to A*. 

Proposition 3. Suppose that t > 1 and (P, 0) e R p x A a . 
Reminding r 2 t (/3, 9) = 2G Xt (P,8)/A 2 , the following holds 


rl(P,0) = 


K-l 
A + 


m 


At -1 


Xp - y 

A t 


( 20 ) 

( A ^i_l)|| 0 || 2 . 

At 


Proof. Details are given in the Appendix. 


□ 


This proposition motivates to screen sequentially as fol¬ 
lows: having (/3, 8) e x Ax such that G Xt l (/?, 9) < e, 
then, we can screen using the GAP SAFE sphere with cen¬ 
ter 8 and radius r A (/3, 9). The adaptation to the GAP SAFE 
dome is straightforward and consists in replacing 9 k , p k , A 
by 9 , P, A* in the GAP SAFE dome definition. 

Remark 8. The basic sphere test of (Wang et al., 2013) re¬ 
quires the exact dual solution 8 = $(**-*) for center, and 
has radius 11 /A* — 1/A*_ 1 1 ||j/||, which is strictly larger than 
ours. Indeed, if one has access to dual and primal opti¬ 
mal solutions atA t _i,i.e., (#,/?) = (^G*- 1 ), p^-Pf then 
r A t _iC 0 > 6 ') = 0, 9 = (y - XP)/ A t _i and 



since ||0|| < ||j/||/At-i for 8 = d^-P. 

Note that contrarily to former sequential rules (Wang et al., 
2013), our introduced GAP SAFE rules still work when one 
has only access to approximations of d^t-P, 
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4. Experiments 

4.1. Coordinate Descent 

Screening procedures can be used with any optimization 
algorithm. We chose coordinate descent because it is well 
suited for machine learning tasks, especially with sparse 
and/or unstructured design matrix X. Coordinate descent 
requires to extract efficiently columns of X which is typi¬ 
cally not easy in signal processing applications where X is 
commonly an implicit operator (e.g. Fourier or wavelets). 


Algorithm 1 Coordinate descent with GAP SAFE rules 


input X,y,e,K,f, (Xt) te[ T-i] 

Initialization: 


-V) ^max 

/ 3 X ° = 0 

for t e [T — 1] do 

/3 <— /3 At_1 (previous e-solution) 

for fee | K | do 

if k mod / = 1 then 

Compute 9 and C thanks to (11) and (18) or (19) 
Get A Xt (C) = {j e [p] : nc\xj) ^ 1} as in (7) 
if G\ t (/3, 9) ^ e then 


break 

for j e A Xt (C) do 

# ST(«,x) = 

threshold) 

output (/3 Xt ) te [T-l] 


x kj( x P-y) \ 

PTiF ' 

sign (a;) (|x| 


u) + 


(soft- 


We implemented the screening rules of Table 1 based on the 
coordinate descent in Scikit-learn (Pedregosa et ah, 2011). 
This code is written in Python and Cython to generate low 
level C code, offering high performance. A low level lan¬ 
guage is necessary for this algorithm to scale. Two im¬ 
plementations were written to work efficiently with both 
dense data stored as Fortran ordered arrays and sparse data 
stored in the compressed sparse column (CSC) format. Our 
pseudo-code is presented in Algorithm 1. In practice, we 
perform the dynamic screening tests every / = 10 passes 
through the entire (active) variables. Iterations are stopped 
when the duality gap is smaller than the target accuracy. 

The naive computation of 9k in (11) involves the compu¬ 
tation of ||.Y T ( Ofc|| x (pk being the current residual), which 
costs 0(np) operations. This can be avoided as one knows 
when using a safe rule that the index achieving the max¬ 
imum for this norm is in A Xt (C). Indeed, by construc¬ 
tion argmax ;g4 A,. (c) \xj0 k \ = argmax jeM \x]9 k \ = 
argmaXjg^ \xj p k \- In practice the evaluation of the dual 
gap is therefore not a 0(np) but 0{nq) where q is the size 
of A Xt (C). In other words, using screening also speeds up 


^ 3 

^ 5 

§>? 
~~ 9 
1 

S; 3 


^ 3 


S 3 
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Figure 3. Proportion of active variables as a function of A and the 
number of iterations K on the Leukemia dataset. Better strategies 
have longer range of A with (red) small active sets. 


the evaluation of the stopping criterion. 

We did not compare our method against the strong rules 
of Tibshirani et al. (2012) because they are not safe and 
therefore need complex post-processing with parameters to 
tune. Also we did not compare against the sequential rule 
of Wang et al. (2013) (e.g., EDDP) because it requires the 
exact dual optimal solution of the previous Lasso problem, 
which is not available in practice and can prevent the solver 
from actually converging: this is a phenomenon we always 
observed on our experiments. 

4.2. Number of screened variables 

Figure 3 presents the proportion of variables screened by 
several safe rules on the standard Leukemia dataset. The 
screening proportion is presented as a function of the num¬ 
ber of iterations K. As the SAFE screening rule of El 
Ghaoui et al. (2012) is sequential but not dynamic, for a 
given A the proportion of screened variables does not de¬ 
pend on K. The rules of Bonnefoy et al. (2014a) are more 
efficient on this dataset but they do not benefit much from 
the dynamic framework. Our proposed GAP SAFE tests 
screen much more variables, especially when the tuning pa¬ 
rameter A gets small, which is particularly relevant in prac¬ 
tice. Moreover, even for very small A’s (notice the logarith¬ 
mic scale) where no variable is screened at the beginning 
of the optimization procedure, the GAP SAFE rules man¬ 
age to screen more variables, especially when K increases. 
Finally, the figure demonstrates that the GAP SAFE dome 
test only brings marginal improvement over the sphere. 

















Mind the duality gap: safer rules for the Lasso 


5- 

4- 

2 3- 
<u 

.1 2 - 

H 

1- 
0- 

Figure 4. Time to reach convergence using various screening rules 
on the Leukemia dataset (dense data: n = 72, p = 7129). 
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4.3. Gains in the computation of Lasso paths 

The main interest of variable screening is to reduce com¬ 
putation costs. Indeed, the time to compute the screening 
itself should not be larger than the gains it provides. Hence, 
we compared the time needed to compute Lasso paths to 
prescribed accuracy for different safe rules. Figures 4, 5 
and 6 illustrate results on three datasets. Figure 4 presents 
results on the dense, small scale. Leukemia dataset. Fig¬ 
ure 5 presents results on a medium scale sparse dataset 
obtained with bag of words features extracted from the 
20newsgroup dataset (comp.graphics vs. talk.religion.misc 
with TF-IDF removing English stop words and words oc¬ 
curring only once or more than 95% of the time). Text 
feature extraction was done using Scikit-Leam. Figure 6 
focuses on the large scale sparse RCV1 (Reuters Corpus 
Volume 1) dataset, cf. (Schmidt et ah, 2013). 

In all cases, Lasso paths are computed as required to esti¬ 
mate optimal regularization parameters in practice (when 
using cross-validation one path is computed for each fold). 
For each Lasso path, solutions are obtained for T = 100 
values of A’s, as detailed in Section 3.3. Remark that the 
grid used is the default one in both Scikit-Learn and the 
glmnet R package. With our proposed GAP SAFE screen¬ 
ing we obtain on all datasets substantial gains in computa¬ 
tional time. We can already get an up to 3x speedup when 
we require a duality gap smaller than 1 (T '. The inter¬ 
est of the screening is even clearer for higher accuracies: 
GAP SAFE sphere is 1 lx faster than its competitors on the 
Leukemia dataset, at accuracy 10 -8 . One can observe that 
with the parameter grid used here, the larger is p compared 
to n, the higher is the gain in computation time. 

In our experiments, the other safe screening rules did not 
show much speed-up. As one can see on Figure 3, those 
screening rules keep all the active variables for a wide range 
of A’s. The algorithm is thus faster for large A’s but slower 
afterwards, since we still compute the screening tests. Even 
if one can avoid some of these useless computations thanks 
to formulas like (14) or (10), the corresponding speed-up 
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Figure 5. Time to reach convergence using various screening rules 
on bag of words from the 20newsgroup dataset (sparse data: with 
n = 961,p = 10094). 
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Figure 6. Computation time to reach convergence using different 
screening strategies on the RCV1 (Reuters Corpus Volume 1) 
dataset (sparse data with n — 20242 and p = 47236). 


would not be significant. 

5. Conclusion 

We have presented new results on safe rules for accelerat¬ 
ing algorithms solving the Lasso problem (see Appendix 
for extension to the Elastic Net). First, we have introduced 
the framework of converging safe rules, a key concept in¬ 
dependent of the implementation chosen. Our second con¬ 
tribution was to leverage duality gap computations to create 
two safer rules satisfying the aforementioned convergence 
properties. Finally, we demonstrated the important practi¬ 
cal benefits of those new rules by applying them to standard 
dense and sparse datasets using a coordinate descent solver. 
Future works will extend our framework to generalized lin¬ 
ear model and group-Lasso. 
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A. Supplementary materials 

We provided in this Appendix some more details on the theoretical results given in the main part. 

A.l. Dome test 

Let us consider the case where the safe region C is the dome V(c, r, a, w), with parameters: center c, radius r, relative 
distance ratio a and unit normal vector w. 

The computation of the dome test formula proceeds as follows: 


and so 


cr c (Xj) = 


\-c' Xj + r\\Xj\ 

<rc{~Xj) = { T , 


dome test as: 


Using the former notation: 


3A) 


M =J 1 _r i ^ 1 

A -i 


— 1 + r 


ifw T Xj < — al a;,- . 

(21) 

a 2 ) otherwise. 

if — w T Xj < —alia;. 

^ (22) 

a 2 ) otherwise. 

-Xj)). Thanks to the Eq. (6), 

we express our 

= 0. 

(23) 

if w T Xj < —al a:.; L 

(24) 

) otherwise. 

if — w T Xj < — a\\xj , 

(25) 

) otherwise. 


Let us introduce the following dome parameters, for any 9 e A^: 


• Center: c = (y/X + 9)/ 2. 

• Radius: r = R x (9)/ 2. 

• Ratio: a = -1 + 2R X (9) 2 / R x (9) 2 . 

• Normal vector: w = ( y/X — 9)/R x (9). 


Reminding that the support function of a set is the same as the support function of its closed convex hull (Hiriart-Urruty 
& Lemarechal, 1993) [Proposition V.2.2.1] means that we only need to optimize over the dome introduced. Therefore, 
one cannot improve our previous result by optimizing the problem on the intersection of the ball of radius R x (9) and the 
complement of the ball of radius R x (/3) (i.e., the blue region in Figure 2). 


A.2. Proof of Theorem 1 

Proof. Define maxj^£ A |a:JfT A )| = t < 1. Fixe > Osuchthate < (1 — t)/(maxjg£ A ll^j ID- As Ck is a converging sequence 
containing 9^, its diameter is converging to zero, and there exists fco e N such that Vfc ^ fco, V0 e Ck, ||9 — 0( A )|| ^ e. 
Hence, for any j $ E x and any 9 e Ck, \xj (9 — (T A ))| < (max^^ ||)||0 — 9^ || < (ma||:Ej||)e. Using the triangle 
inequality, one gets 


\x]9\ =g(max ||x,-||)e + max |a;7^ A )| 

3 i$£x 3 jf£x 3 
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provided that e < (1 — t)/(max^£ A ll^jll)- Thus, for any k ^ ko,£ £ cz Z^(Ck) = A^ x \Ck) c and A^(Ck) c £ A . 

For the reverse inclusion take j e £\, i.e., = 1. Since for all k e N,0( A ) e Ck , then j e A^(Ck) = {j e [p] : 

ma xgeCk |xj0| ^ 1} and the result holds. 

□ 


A.3. Proof of Proposition 3 

We detail here the proof of Proposition 3. 

Proof. We first use the fact that 


G Xt _ 1 {^6)= l -\\X^-y\\l + \ t . l \ 




At-i 


to obtain 




\2 

A t -1 


0 — 


't-l 


+ G At _ 1 (/3,0)). 


Then, 


Gx t (13,9) = ± \\Xfi - y\\l + ^ (\ \\y\\l - ± \\XP y\\ 


\ 2 

I 2 _ At ~ l 
2 2 


9 — 


At-i 


+ G At _ 1 (/3,(9)) 


> 2 + y 


<9- 


At 


= * (y^ - 1) II&II 2 + J(! - 3 T-) 11*0 - 2/112 + t —Gx t _AP, 0) + l( Me y\\l - ||A t _i0 - y|| 

Z At—1 Z At—1 At—1 Z At—1 

-1) II 2 /II 2 + Ja - ir-) 11*0 ~ y\\l + t —GxMP, o) 

Z At —l Z At—l At —l 


+ 


^ (11^0 — y\\ 


2 - _ 2 /II 2 + ll(At-i - A t )0|l2 + 2 (A t9 - t/) T (A t _i - A t 


At-i 


We deal with the dot product as 


2At(A*_i — A t){0 — —) T # — At(At_i — At)( ||0|| 2 + 

a* 


0-1. 

2 

2 / 

At 

2 

At 


)• 


Hence, 


Gx t (M) =\(P~ ~ 1 + y~(At-i - A t )) Ml + t(! - ir-) 11*0 - 2 /II 2 

Z At_i At—1 Z At—1 

—y(At-i ^ A t )) ||0||2 + ~(1 - y““- t —(At —1 - A t )) ||A t 0 - y\\l + ——G At _ 1 (/3,0) 

Z Z At-l At_i At-l 

= ^ (i - 11*0 - V\\l - y (At-i - A t ))|0|| 2 + ^-Gx t _M0). 


We observe in the end that 
2 


A? G At 0M) = ( 1 -^ 


*0-2/ 

At 


At-i 


- 1 


I 2 + y T-Gx t _ 1 (/3,9). 

At-lAf 


□ 
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A.4. Elastic-Net 

The previously proposed tests can be adapted straightforwardly to the Elastic-Net estimator (Zou & Hastie, 2005). We 
provide here some more details for the interested reader. 


1 

min - 

0em.p 2 


WXP-yWt + Xam, 




^ ml 


One can reformulate this problem as a Lasso problem 


. 1 
mm - 

/3eRp 2 


X/3 - y 


+ Act H/3II,, 


(26) 


(27) 


where X = ( . ) e R n+p,p and y = ( ^ ] e R” +p . With this modification all the tests introduced for the 

VV(! - a)XI p J y V 0 / 

Lasso can be adapted for the Elastic-Net. 








